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ABSTRACT 



A computer system for performing searches on a collection 
of information includes a mechanism through which results 
from a search query are ranked according to xiser specified 
relevance factors to allow the user to control how the search 
results are presented. The relevance factors are applied to the 
results achieved for each query. That is, each item returned 
by the search has a set of attributes. Each of these attributes 
is assigned a weight according to the specified relevance 
factors. These weights are combined to provide a score for 
the item. Search results arc provided to the user, ordered 
according to scores. The application of the relevance factors 
docs not alter the query performed on the collection of 
information. 

12 Claims, 9 Drawing Sheets 
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COMPUTER SYSTEM WITH USER- 
CONTROLLED RELEVANCE RANKING OF 
SEARCH RESULTS 

FIELD OF THE INVENTION 

The present invention is related to the searching of 
collections of information. In particxilar, the present inven- 
tion is related to methods for ranking items received as the 
result of search of a collection of information. 

BACKGROUND OF THE INVENTION 

There are generally two methods used for searching for 
items within a collection of information, such as a database 
containing multiple information sources such as text docu- 
ments. The first method commonly is called a Boolean 
search which performs logical operations over items in the 
collection according to rules of logic. Such searching uses 
conventional logic operations, such as "and", "or*^ or "not," 
and perhaps some additional operators which imply ordering 
or word proximity or the like or have normative force. 
Another method is based on a statistical analysis to deter- 
mine the apparent importance of the searched terms within 
individual items. The search terms accrue "importance" 
value based on a number of factors, such as their position in 
an item and the context in which they appear. For example, 
a search term appearing in the title of a document may be 
given more weight than if the search term appears in a 
footnote of the same document. There arc several forms, 
variations and combinations of statistical and Boolean 
searching methods. 

One problem with searching large collections of infonma- 
tion of many items (e.g., records, text documents, etc.) is that 
a particular query may provide search results which include 
items irrelevant to what the Cramer of the search has in mind 
or items which are too numerous for all to be reviewed. 
Using a large public computer network like the Internet to 
search a database of information available on the network, 
search results may be too numerous or of little value to the 
user and the search engine may be very frustrating to use. 
While the search results may be presented in an order 
according to some rule, such as by displaying the newest 
item first, by placing the items in alphabetical order, or by 
ranking the items according to some score assigned to the 
item, most search engines do not provide the capability for 
a user to control how search results are presented to a user 
or, at best, allow only minimal control in a manner that 
actually changes the query performed and hence affects the 
search results. 

SUMMARY OF THE INVENTION 

The present invention provides a mechanism through 
which results from a search query are ranked according to 
user-specified relevance factors to allow the user to control 
how the search results are presented, e.g., their order. The 
relevance factors are applied to the results achieved for each 
query. That is, each item returned by the search has a set of 
attributes. Each of these attributes is assigned a weight 
according to the specified relevance factors. These weights 
are combined to provide a score for the item. The scores of 
the items control the presentation of search results. The 
application of the relevance factors does not alter the query 
performed on the collection of information. 

In one embodiment, each relevance factor is assigned a 
base value. These base values and an associated bonus are 
applied to a set of items retrieved by the search query to 
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obtain a score for each item. By allowing the user to specify 
the base values, the relevance metric is tunable to the needs 
of the user. 

One fac tor which may be used to affect the relevance 

5 sc ore of an i te m mcludes the mcilllOD 6t & Search term in tfie " 
i fem. ^or example, wit h structure d documents su ch as those 
written in SGML, KiML , or other markup languagesTth e 
stmctural IrffiSnation aSout the document may enclose 
s^ ^n terms and may result in a document bein^ consider ed 

10 more relevant than another. T he position of search terms in 
the body of a document, called salience, also may be used. 
For example, a search term appearing in the first sentence of 
the first paragraph of a field in a document may have greater 
salience than the same term found in the last sentence of a 

15 last paragraph of the same field. The frequency of occur- 
rence of a search term in an item, or of the search term in all 
items, the number of search terms found in an item, the 
ordering of search terms in the item, the distance between 
terms in a item, and prefixed instance or stemming are some 

20 of the factors which may be used to compute a relevance 
score for a given result returned by the search engine. Other 
possible factors include, but certainly are not limited to, the 
recency of the item or the location of the item within a file 
system or directory of files. 

^ Accordingly, one aspect of the present invention is a 
computer system for providing user-controllable relevance 
ranking of search results from a query on a collection of 
items of information. The computer system includes a 
relevance determination module having a first input for 
receiving a set of search results from a query indicating 
items in the collection matching the query, a second input for 
receiving an indication of relevance factors specified by a 
user, and a third input for receiving information about the 
items in the set of search results to which relevance factors 
may be applied. This module has an output for providing an 
indication of a score indicative of relevance for each of the 
items in the set of search results. A sorting module has an 
input which receives the score associated with each item and 
an indication of the set of search results, and an output 
providing to the user an indication of the items in the set of 
search results in an order ranked according to the relevance 
score of each item. 

Other aspects of the invention include the process per- 
formed by the computer system to apply the relevance 
factors to the search results to provide a score for each item 
in the search results. Another aspect of the invention is a 
client computer and the process performed by the client 
computer to communicate with a database server to provide 
relevance factors and receive the ranked searched results. 
Another aspect of the invention is a server computer and the 
process performed by the server computer to receive and 
process a query and relevance factors from a client computer 
to produce relevancy ranked search results. 

55 BRIEF DESCRIPTION OF THE DRAWING 

In the drawing, 

FIG. 1 is a block diagram of one embodiment of the 

present invention; 
6Q FIG. 2 is a block diagram of a second embodiment of the 

present invention; 

FIG. 3 is a block diagram of an embodiment of the present 

invention using a client computer and a server computer 

interconnected over a computer network; 
65 FIG. 4 is a flow chart describing how the relevance 

determination module determines a score for each item 

retrieved from a query; 
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FIG. 5 illustrates a graphical user interface for a browser The computer system 100 may be one or more general 

for permitting a user to input a search query and values for purpose computer systems which are programmable using a 

relevance factors; high level computer programming language, such as "C, or 

FIG. 6 aiustrates another embodiment of the graphical "Pascal." The computer system also may be implemented 

user interface- and 5 Rising specially programmed, special purpose hardware. In a 

^ . r- , , eeneral purpose computer system, the processor is typically 

HGS, 7-9 are illustrations of search results presented by ^ commercially available processor, of which the series xS6 

one embodiment of the invention, processors, available from Intel, and the 680X0 series 

DETAIt£D DESCRIPTION microprocessors available from Motorola are examples. 

-jQ Many other processors are available. Such a microprocessor 

The present invention will be more completely under- executes a program called an operating system, of which 

stood through the following detailed description which UNIX, DOS and VMS are examples, which controls the 

should be read in conjunction with the attached drawing in execution of other computer programs and provides 

which similar reference numbers indicate similar structures. scheduling, debugging, input/output control, accounting, 

Referring now to FIG.l. a computer system 100 using the compilation, storage assignment, data management and 

present invention will now be described. The computer memory management, and communication control and 

system 100 has access to a database 102 which is queried by related services. The processor and operating system define 

a database query engine 104 in response to a search query a computer platform for which application programs in 

106. In the present invention, a database is any collection of high-level programming languages are written. It should be 

information and contains several items. Each of the items in understood the invention is not limited to a particular 

the collection may be compared to a search query to deter- computer platform, particular processor, or particular high- 
mine whether the item matches the search query. ^The^ level programming language. Additionally, the computer 

collection of information may be the Internet, a_simil ar system may be a multiprocessor computer system or may 

network having a collection of documents, or a priva te include multiple computers connected over a computer 

s tructured database or any other searchable entity. Such a ^ network. As such, the database may be local to the user or 

database typically includes an index representing each item remote. 

in the collection of information in order to simphfy the A suitable computer system to implement the modules of 
search process. Ip some cases, such as with a search engi ne FIGS. 1 or 2 typically includes an output device which 
for the World Wide Web, or the Internet, the index i s displays information to a user. The computer system 
^cessed by the query engine and the actual documents to b e includes a main unit connected to the output device and an 
a ccessed using the results ot a query are trom a tmra party input device, such as a keyboard. The main unit generally 
source. includes a processor connected to a memory system via an 
"^user supplies the search query 106 to the query engine interconnection mechanism. The input device is also con- 
104 through a user interface 108. The database query engine nected to the processor and memory system via the connec- 
104 applies the search query 106 to the database 102 to 35 tion mechanism, as is the output device, 
provide search results 110 which include an indication of the It should be understood that one or more output devices 
items in the database 102 which match the search query 106. may be connected to the computer system. Example output 
The search results typically include enough information to devices include a cathode ray tube (CRT) display, hquid 
access the actual item, but generally does not include the crystal displays (LCD), printers, communication devices 
entire item in order to reduce the amount of memory needed ^ such as a modem, and audio output. It should also be 
to process the search results. In the invention, a relevance imderstood that one or more input devices may be connected 
determination module 112 receives the search results 110 to the computer system. Example input devices include a 
from the database query engine 104 and applies pre- keyboard, keypad, track ball, mouse, pen and tablet, corn- 
specified relevance factors 114 to each of the corresponding munication device, audio input and scanner. It should be 
items in the search results 110 to obtain scored search results 45 understood the invention is not limited to the particular input 
116. In particular, each of the items in the search results 110 or output devices used in combination with the computer 
has a set of attributes associated with it, which the module system or to those described herein. 
112 may use the database 102 to access and identify if such a memory system typically includes a computer readable 
information is not made available in the search results 110. ^nd wfiteable nonvolatile recording medium, of which a 
Each of these attributes is given a weight according to the 53 magnetic disk, a flash memory and tape are examples. The 
specified relevance factors 114. These weights are combined ^^^1^ ^^ay 5^ removable, known as a floppy disk, or 
to provide a score for each item. The scored search results permanent, known as a hard drive. A disk has a number of 
are sorted by sorting module 118 to provide ranked results tracks in which signals are stored, typically in binary form. 
120 which are provided to a user interface 122 to be output [ ^ fopn interpreted as a sequence of one and zeros. Such 
to the user. 55 signals may define an application program to be executed by 

Another embodiment is shown in FIG. 2. In this computer the microprocessor, or information stored on the disk to be 

system 130, the search results 110 do not include a score processed by the application program. Typically, in 

with each item. Therefore, the relevance determination operation, the processor causes data to be read from the 

module 128 outputs scores 124 separately for each item in nonvolatile recording medium into an integrated circuit 

the search results. Both the search results 110 and the list of eo memory element, which is typically a volatile, random 

scores 124 are used by the sorting module 124 to produce access memory such as a dynamic random access memory 

ranked results for the user. The embodiment is otherwise the (DRAM) or static memory (SRAM). The integrated circuit 

same as shown in FIG. 1. memory element allows for faster access to the information 

The modules 108, 104, 102, 112, 118 and 122 in FIGS. by the processor than does the disk. The processor generally 

1-2 may be implemented using one or more general purpose 65 manipulates the data within the integrated circuit memory 

computers which execute an application prpgram written in and then copies the data to the disk when processing is 

a computer program language. completed. A variety of mechanisms are known for manag- 
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ing data movement between the disk and the integrated server, examples of which were described above. The HTTP 

circuit memory element, and the invention is not limited server 160 has an interface through which a query 162, 

thereto. It should also be understood that the invention is not derived from the input 156 from the browser 150, can be 

hmited to a particular memory system. provided to a database query engine 164. Similarly, rel- 
In one example embodiment, the user interface 108 may ^ evance factors 166 derived from the input relevance factors 

be any suitable user interface for providing the search query can be provided to a relevance determination module 

106 and relevance factors 114 to the database query engine database query engine 164 provides search results 

104. Such an interface includes, but is not limited to, a chent 1^0 to the relevarice determination module. In addition, the 

application program, commonly called a "browser," relevance determination module may access the database 

executed on a general purpose computer which communi- 172. This relevance determination module 168 then provides 

cates over a computer network with an application program the scored results 174 which are provided through a sorting 

executed on a server computer, called a "server," using module 176. The ranked results 178 provided by the sorting 

messages containing formatted data which the server parses module are formed into an HTML document which is 

and provides to a database query engine. Examples of such returned to the browser 150 via the HTTP server 160 as 
browsers include the Navigator browser from Netscape 15 mdicated at 180. 

Communications, Inc., and the Internet Explorer browser Having described the general environment in which the 

from Microsoft Corporation. These browsers present docu- present invention may be used and a particular embodiment 

ments defining a form which can be completed by a user to thereof, the application of user-defined relevance factors 114 

include the search query 106 and relevance factors 114. An to search results 110 by a relevance determination module 
example display for a user interface in one embodiment of 20 112 will now be described in more detail, 

the invention wUl be described in more detail below in pirst, the kinds of relevance factors that may be used will 

connection with FIG. 5. In response to the user input, the be described. A relevance factor is a value associated with an 

browser sends a message containing the search query and attribute which an item in a database may have that either 

relevance factors to a designated server which processes the other items in the database might not have to the same 
query. How such a user interface may be provided to allow 25 ^^^^^^ ^^^^^^ attribute may have a range of values), or 

for user input of relevance factors will be described in more vvhich other items in the database might not have at all 

detail below. (where the attribute is either present or not). For example, 

The user interface also may be a custom user interface whether a docimient contains a particular word is an 

provided by either a private on-line computer service, of attribute of a document. A date associated with a document 

which LEXIS/NEXIS online service and WestLaw online may be attribute. The location of a document in a directory 

service arc examples, or any other database system. in a file system, the size of a document, and other features 

The database query engine 104 may be implemented may all be attributes, 
using a computer program, to be executed on the server A few examples of relevance factors and their associated 
computer or another general purpose computer, which attributes for documents will now be described in more 
implements some techniques for performing database detail. One relevance factor is the location of a search term 
queries, of which several are known. For example, the in the document, or the field that contains the search term, 
database query engine may be a program associated with an For example, if a search term occurs in the title of a 
HTTP server, such as the HTTP server available from document, that doctmaent may be more relevant than a 
Netscape Communications, Inc., called the Netscape Enter- ^ document in which the search term appears in a footnote. If 
prise server. Such a server has an application programming a structured document is being used, such as a document in 
interface (API) which enables other computer programs to the standard generalized markup language (SGML) or one 
be connected to and accessed through this server to perform of several document types, such as documents in the hyper- 
various functions, including database queries. Other text markup language (HTML), the structural information 
example database query engines include those provided about a document may be used to give more or less weight 
through a variety of private on-line services and to a term depending upon the enclosing tags. For example, 
commercially -available database systems as described a search term which appears inside a <TITLE></TITLE> 
above. tag pair might be given a greater relevance weighting than 

The user interface for output 122 may be the same as the the same word in the same document but in normal body 

user interface 108, or may be another mechanism, "^uch a text. 

printer, electronic mail, data file, or some other source of Another relevance factor is the position of search terms in 

data which may be accessed by a user. the document, called the salience of the search term. This 

As will be understood from the foregoing, elements factor relates to the position of a word in a specific field of 

102-108 and 120 may be any of a variety of kinds of systems a given document. For example, a search term appearing in 
for performing database queries that are well known in the 55 the first sentence of the first paragraph of a field will have 

field. In addition, it should be understood that the various greater salience to the field search than the same term found 

modules shown in FIGS. 1 to 2 may be implemented, in the last sentence of the last paragraph of the same field, 

combined and/or integrated in a variety of different ways. In Another relevance factor is the frequency of occurrence of 

particular, the coordination of the transfer of data between a search term in the document. The number of times a word 
the modules may be performed in any desired manner, go appears in a document relative to the number of aU words in 

^ FIG, 3 shows a particula r em bodiment of the inven tion the document can indicate the relevance of a document. For 

which uses a browser ^'^^^ descnhed ahnve whichpre - example, a long document that uses the word "Clinton" a 

sents HTML documents to a user as shown at 150 in FIG^ . few times is probably less relevant to a search for "Bill 

The browser can both receive input from a user and provide Clinton" than a document of the same length that uses the 
output as indicated at 152 and 154, respectively. In this 65 word "Clinton" many times. 

embodiment, the user-provided search query 156 and rel- Another relevance factor is the frequency of occurrence of 

evance factors 158 are sent to a server 160 such as an HTTP a search term in all documents. The number of times a word 
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appears in the collection of documents relative to the total 
size of that collection affects the relevance of a term to a 
specific docunaent. This is referred to as TFIDF weighting 
for Term Frequency Inverse Document Frequency weight- 
ing. 

Another relevance f actor is the number of search ter ms 
found in the document. For example^ if the user enter s a 
qu ery that has six search terms, than documents wh ich 
contain all s ix search terms generally are consid ered more 
r eleva nt than documents whi ch contain only five of the six 
s earch terms. I'his relevance factor is particularly usefu l in 
calculating th e releva nce of logical OR searches. 

A nother relevance factor is the ordering of search terms in 
the 3ocument. That is, if the query terms appear m their 
given order in a document, than a relevance bonus may be 
applied to the document. For example, if the query entered 
is "Bill AND'Clinton/* then documents which preserve this 
order will be considered more relevant than documents in 
which the word "CUnton" preceded "Bill" in the document. 

pother relevance factor is the pairwis e dis tance betwe en 
s earch terms in a document. In otheT. words, the c loser 
tog ether the search terms appear in the document, t^e hig her 
t hrjelevance bonus may be. For exam ple, if the query i s 
' Bill" and "Clinton." then documents which contain "B ill" 
and " Clinton" next to each other will rank higher tha n 
cfocunoents in which "Bill" and "Clinton" are separated by 
m^ervening wor ds . 

Another relevance factor is related to the length of sear ch 
words and is based on stemming. This factor is important if 
word stemming is supported in the search engine, '^□rd 
stemming is a way of expandin g the number of search t erms 
by a pplying a Senes of suffixes to a base search term. Fo r 
ex ample, if a search term is bill", when stemming is 
employed the search engine might also search for "Bills," 
"Billion," etc. Of these, the original search term, "Bill," will 
be considered more relevant than the other examples, 
whereas the term "Bills" will be considered to be more 
relevant to the search than "Billion." Other longer stemmed 
extensions correspondingly are less relevant. 

Def ault values f or the relevance factors used in any 
particuiaT section may be stored as global variables of the 
database or the database query engine or the relevance 
determination module. The following table sets forth an 
example of name, data type and default values for the 
foregoing, relevance factors, and a description of each. _ 



Name 



Max Default 



Description 



weight_word_jnatch 1000 1000 



weight__tfidf 



weight__field 
weight_position 

wcight_p roximity 



100 



90 



15 



Base for the number which is 
added for each word from the 
query matching for a record 
Base for the It rm- Frequency 
lavcrt-Documcnt-Frcqucncy 
calculations 
Base for field boauses applied 
with the field- configuration file 
Base for the word position 
within a field. Words closer 
to the front of the field receive 

a higher bonus. 
Base for the bonus based en two 
words of distance firom each 
other. Words closer together 
receive a higher bonus 



20 



25 



35 



40 



45 



55 



g 

-continued 



Name 


Max 


Default 


Description 


weight_ordcr 


20 


2 


Base for the bonus based 








on word order. Words in a 








document in the 








same order as the search 








receive a bonus 


weigh t_prcfix 


20 


10 


Base for the bonus based on 








word prefix size for word 








stemming 



These default values are useful as a starting point when 
presenting a user with an interface for adjusting the rel- 
evance factors. For example, in the embodiment of FIG. 3, 
a document may be prepared for display to the user based on 
these variables. The user may then manipulate several 
parameters of the user input interface to vary the relevance 
factors. Each of these factors is defined as a parameter which 
is associated with a value. It should be understood that 
additional parameters easily can be added and that the 
invention is not limited to the parameters shown or any 
subset thereof. "MAXINT' is the maximum integer value 
supported by the relevance determination module which is 
2"-l, where n is the number of bits used to represent an 
integer. The table below illustrates the parameters in one 
embodiment of the invention. 



PARAMETER 


DESCRtPTTON 


POSSIBLE VALUES 


rt 


enable relevance tuning 


1, yes, true 


rtwm 


word match 


0">MAXINT 


rttf 


TFIDF 


0->MAXOT 


ftfd 


field 


a-.>MAXINT 


rtpn 


position 


0-->MAXINT 


rtpy 


proximity 


0->MAXINT 


rtor 


order 


0-->MAXINT 


rtpx 


prefix 


0->MAXINT 



In the embodiment shown in FIG. 3, the user may submit 
a query by inputting values through a form or other interface 
in the browser, which arc converted into the fonn of a 
uniform resource locator (URL) by the database query 
engine. The standard form of a URL includes an indication 
of a protocol, a host, a filename and parameters, separated by 
delimiters, as follows: 

pro toco l://host/filename?parameterl- 
value 1 &parameter2=value2. 

The relevance factors would be used as parameters 
included in the URL separated by an ampersand (&) delim- 
iter. As an example, to enable relevance tuning on a query 
and to set the "order" weight to 100, the following would be 
submitted: 

http://host/cgi-bin/query_program?query_terms&rt= 
l&olOO 

where the query-program is the program that, when 
executed, is the database search engine, query-terms are the 
search terms. 

It should be understood that any other form of messa ge 
that contains the search terms and relevance factors may b e 
u sca to communicate them to the database query engine an d 
that the invention is not iimited to any particular form. Th e 
user also may specif y a idnd of search tor which the weigbit s 
" associated with the relevance factor s are pre determined. 

M(j. 5 illustrates an example graphical user interface 
through which a user may input various values for the 
releyance factors. The relevance factors shown in this inter- 
face "include the word match 300, frequency (TFIDF) 302, 
field 304, position 306, proximity 308 and order 310 factors. 
The user manipulates a button on a slider bar (e.g., button 



10/08/2003, EAST Version: 1.04.0000 



6,012,053 



10 



typedef struct { 






unsigned int 


crror_chcck; /* an error checking value 






to verify struct */ 


unsigned int 


flags; 


/• associated bit-flags "/ 


crid_t 


end; 


/* record identifier of this 






item in the catalog */ 


unsigned int 


match_n; 


/* number of words that 






matched */ 


rel_t 


relevance; 


/* metric from 1 (low) to 






MAXINTT (high) */ 


cat_Jd_t 


catalog id 


/* identifier of the catalog 


} result_t; 




containing this item */ 







In this data structure, qr_error__check is an error check- 
ing value used to verify the structure. 

flags: this value naay be used to indicate any additional 
information associated with this result entry. 

crid: this value is the catalog record identifier which 
references the database record of the document. 



10 



312 ) to adjust the value for the factor. The corresponding 
value set by the user (corresponding to the slider button 
position, that is) is displayed at a box such as 314. A region 
316 of the interface allows the user to input a search query. 
Such an interface may be created, for example, by appro- 
priate programming using the Java programming language. 
Other interfaces may be created by using HTML forms to 
allow a user to type in a value or to select a value from the 
menu. 

An example embodiment using an HTML form is shown 
in FIG. 6. In FIG, 6, the embodiment docs not use a 
Java-implemented interface. The search input panel 320 is 
similar to panel 316 in FIG. 5, except a drop-down menu 322 
allows a user to specify a kind of search, such a specifying 
finding all of the words or any of the words. The embodi- 
ment also may allow the user to specify finding the exact 
phrase, or performing natural language query or the speci- 
fied boolean expression. 

In the embodiment shown in FIG. 5, from the user's 
perspective all values for the relevance factors are in the 
range of zero to 100. Such an interface may be more intuitive 
to a user than an interface that uses the actual range of 20 
weight values because the relative importance of a factor 
may be displayed. For instance, if the position factor is 
assigned the vaJue 100 and the proximity factor 50, then the 
position of words in a document is twice as important as 
their closeness together. The input value is then mapped to ^ 
the range of values for the weights actually used by the 
relevance determination module described in the table 
above. This mapping may be cither linear or non-linear. 

As is commonly done with search queries in general, in 
the embodiment shown in FIG. 3 the search terms and other 
parameters in a URL are processed on the server side by 
parsing the URL, The search terms and relevance factors 
extracted from the URL are then formed into respective data 
structures which are used, respectively, by the database 
query engine and by the relevance determination module. 

The actual form of the query, its representative data 
structure, how the query \& performed and how results are 
returned involve common techniques known in the art. For 
the purposes of understanding and illustrating the present 
invention, a query typically returns an array, list or other data 
structure containing records, or other data structures, which 
indicate each record in the database that matches the query. 
Such records typically include an identifier of the database, 
if more than one database was searched, and an identifier of 
the record in the database. 

An example data structure for returning a single record 
about a single document that meets a user's query is 
described below. An array of these data structures is typi- 
cally returned by the query engine. 



30 



45 



50 



55 



60 



65 



match_n: this value represents the number of terms that 
were matched in this record. This value is reflected as part 
of the relevance score. 

relevance: this value represents the metric for relevance. 
A larger number indicates that the record is more relevant to 
the query. The value initially may be zero as returned by the 
query engine, or the query engine may initialize the value 
according to the word match relevance factor. After pro- 
cessing by the relevance determination module, this value is 
the final relevance value used by the sorting module 
described below. 

catalog_id: this value is the numerical identifier of the 
catalog or database in which this match was found. 

A sample output from a query engine may be the follow- 
ing: 

{error_checking,0, 12341, 4, 0, 2}, 

{error_checking,0, 145, 1, 0, 1), 

{error_checking,0, 10341245, 3, 0, l}, 

{error_checking, RESULT_END} 

Each of fields in the data structure for the relevance 
factors represents a weight which is used to increase or 
decrease the bonuses given for the corresponding attribute 
for each document. An example data structure for the 
relevance factors defined in the C programming language is 
the following: 



typedef struct { 
rcl_t word_match; 
rel_t tfldf; 
rcl_t field; 
rel_t position; 
rel_t proximity; 
rcl_t order; 
rel_t prefix; 
} weights; 



/* for each word matching in the docixmcnt */ 

/* weight for tfidf ratio */ 

/* field bonus: title, body, etc. */ 

/* position of word in a field */ 

/• closeness of two words •/ 

/* word order V 

/* prefix distance for stemming */ 



35 



In this data structure, the fields have the following mean/ 
ings: \^ 

word_match: This weight corresponds to whether a ter m 
in'the que ry o ccurs in a document. For example, if each ter m 
i n a three term query occurs in a document, regardless of the 
n umber of times the terms occurs, the document receive s a 
p artial score for this factor of three times the weight assigne d 
t o this factor . I n general, this value should be much high er 
t han the ot hers becaus e documents that have more ofT he 
search terms should be greatly rewarded^ v d efault, som e 
q uery tuncti6H5 kiYtkdy sort documents based solely on the 
n umber of terms matching in a document. This feature m ay 
be overridden in some systems to allow document s can be 
s orted on a relevance basis in the invention" 
" tfidf: This weight corresponds to the Term-Frequency 
Inverse Document- Frequency value. As discussed above, in 
general, TFIDF is a metric which compares the frequency of 
a term in a document compared to how frequent the term 
occurs in a corpora. 

field: This weight corresponds to the field in which a 
search term occurs. For example, a term occurring in a title 
field probably should result in a relevance bonus higher than 
a term occurring in the document body. 

position: This weight corresponds to the position of a term 
in a field. A term receives a bonus if it appears closer to the 
front of a field. Terms closer to the front of a document 
usually are more indicative of the subject of and therefore 
are more relevant to the document. 

order: This weight corresponds to whether two search 
terms occur in order in the document. If the search query is 
"word- a AND word-b'" and word-a is before word-b in a 
document, a bonus is applied. 
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prefix: This weight corresponds to the number of charac- 
ters difference in length of the search term and a term in a 
document. For example, if the search terms is "dog" and the 
document has "dogs", which has one extra character, the 
document should be more relevant to the search than a 5 
document with "dogging," which has three extra characters. 

Given the relevance factors and the search results, each 
item which matches the query is given a score according to 
the relevance factors. In order to perform this scoring, the 
record for the item in the database is analyzed to determine lO 
whether its attributes match the criteria for the factor in order 
to receive the weight associated with the factor. The infor- 
mation needed to determine the bonus to be applied typically 
is readily available in an indexed database since the index is 
needed to perform the query in the first place. Such infer- 15 
mation also may be provided in the search results. 

One embodiment of a technique for determining the 
relevance score for each document will now be described in 
connection with FIG. 4. This embodiment assumes that a list 
or array of search results, identifying database records, has 20 
been received. The first step, 200, is obtaining the next 
record to be analyzed from the search result. The relevance 
value for each relevance factor is then determined in step 
202. This determination will vary for each factor, as will be 
described below. The bonuses determined for all of the 25 
relevance factors are then summed in step 204. This sum is 
inserted in the search results record in step 206 where this 
record contains a slot for the score of each item as shown 
above. If all the records have been analyzed as determined 



is then multiplied by the TFIDF weight. The natural loga- 
rithm of this product provides the bonus applied to the 
document for this search terai for this relevance factor. This 
computation is performed for each term in the search query. 
The field and position bonuses are determined together for 
every word in the query. For a given word, the most relevant 
field is identified first. This most relevant field can be 
determined by ranking, in order of importance, the kinds of 
fields in the various documents in the database. Each docu- 
ment in the search results is searched to determine the most 
important field in which the search term appears. In one of 
the embodiments, the title is the most important field. If a 
search term appears in the title, the document is given a 
certain bonus. The occurrence of the term in other less 
important fields is given an increasingly lower bonus. The 
result of this computation is a value the type of the identified 
field, multiplied by the weight corresponding to this rel- 
evance factor. This product is added to the total score for the 
item. A position value also is computed for the instance of 
the search term in the identified most relevant field. This 
position value may be either the absolute position in the 
document of this instance of this search term, or the position 
value may be the position of this instance of this search term 
in the identified field. Another bonus for the document is 
then determined by subtracting this minimum position value 
from the total number of instances of all words in the 
document or the identified field, plus one. The resulting 
difference is divided by the total number of instances of all 
words in the document or the identified field. That quotient 



in step 208, tbe process is completed; otherwise, the process 30 f ^^^^^^ by the weight corresponding to this relevance 



Ls repeated for the next record in the search results, in step 
200. 

Determination of the bonus for each relevance factor, step 
202, will now be described by way of example. Since there 
are a variety of ways to compute a bonus value for a 35 
document for each relevance factor, the invention is not 
hmited to the following example. While this example is 
provided for text documents, it should be understood that the 
invention is not limited thereto. 

Generally speaking, where the attribute of the document 40 
is either present or not, such as whether a search term occurs 
in the document, the bonus may be applied to the document 
simply according to the presence or absence of this attribute. 
For example, for every word in the search query which 
occurs in the document, the weight corresponding to this 45 
relevance factor is multiplied by the number of matched 
terms to produce the bonus. On the other hand, where the 
attribute corresponding to the relevance factor is a range of 
values, there are several approaches for determining the 
ultimate bonus. For example, the attribute may be converted 50 
into a fraction which is multiplied by the weight correspond- 
ing to the relevance factor to obtain the bonus. 

A specific formula for determining the bonus for a docu- 
ment corresponding to the relevance factors illustrated in 
FIG. 5 will now be described. For the frequency or TFIDF, 55 
field and position factors, the computation is performed for 
each term in the search query. For the proximity and order 
factors, the computation is performed for each pair of terms 
in the query. 

In order to compute the bonxis corresponding to the 60 
TFIDF factorO, the ratio of the number of instances of a 
search term in a document to the total number of instances 
of all terms in a document is computed. Then, the total 
number of instances of all terms in the catalog or database 



factor. The resulting value is added to the score for the 
document. 

The proximity and order bonuses may be determined 
together for any given pair of words in a query. Given a pair 
of words, a list of all of the instances of that word pair in the 
document is obtained; typically this data which can be 
obtained readily from the index. This list of instances should 
include an indication of the position of the instance of the 
word in the document. A distance is computed between 
every instance of one word and every instance of the other 
word in its pairing. The minimum distance is retained. If this 
distance is below a predetermined maximum distance, then 
a bonus is given to the document. This bonus is computed by 
determining the difference between the maximum value and 
the computed distance, less one. This difference is divided 
by the maximum value. The resulting quotient is multiplied 
by the weight for this relevance factor. If the two corre- 
sponding instances of the two words occur in order as they 
appear in the search query, the weight for the order relevance 
factor also is added to the score for this document. 

It should be understood that there are many other ways to 
apply relevance factors to search results, and that there are 
many other relevance factors that may be used. Accordingly, 
the invention is not limited to a particular set of relevance 
factors or to a specific method or methods for applying them. 

After processing by the relevance determination module 
as described above, the array of search results may appear as 
follows, where the last entry has a value "RESULT- END" 
that indicates the end of the array. 

{error_checking, 0, 12341, 4, 12006, 2}, 

{error_checking, 0, 10341245, 1, 673, l}, 

{error_checking, 0, 145, 9013, 3, 1}, 

{error_checking, RESULT_END} 

Having now described how the results are given a rel- 



Ls computed and its ratio to the total number of instances of 65 cvance score, there are many ways in which the score may 
the search term in the catalog or database is computed. The be combined with the search results to provide meaningful, 
product of these two ratios is then determined. This product ranked results to the user. The sorting module generally 
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processes the array of scored search results to sort the array, (d) frequency of occurrence of the search term in the item; 

using known techniques, and to generate an output to be (e) frequency of occurrence of the search term in all items 

provided to a user that includes an indication of the docu- of the collection of items; 

ments that matched the search query again using known ,^, ^ -J^ ^ ^ ^^^^ „f ^^^^ 

techniques. Such a document may include an indication of 5 term* 

the database record and its associated document, and pes- , ^ , * . r , - ^ 

sibly its score, and preferably provides a way to access the (g) ^^em withm a directory of files; and 

document. (^) recency of the item. 

An example result is shown in FIG. 7. In this 3. The computer system according to claim 1, wherein the 
emb od iment, the ^ores are shown tor each item. bu Tmoto r lO ^^^"7 includes at least two search terms and the relevance 

emgoaiments, such sco res may be omitted. This search is the f^^toi^ include at least one of a group of relevance factors 

r esult 01 the query shoWn at 320 tri TOT^ h acfa item including: 

includes a hy pertext link 33U to the source ot the document, (a) number of search terms found in a item; 

a descriptor 332 Of th6 dOCUmem (usuallyTgXt'tak en from the (b) ordering of search terms in the item; and 
beginnmg ot the document), an indication 334 oflhe jource 15 (c) pairwise distance between the search terms in the item, 

o f the document and an indicaiion of us score, as a hinction 4. xhe computer system according to claim 2, comprising 

of the maximum score of the retrieved iiems. FIG. S'tllus- a graphical user interface for collecting relevance factor 

trates results achieved with the same query when the rel- information from a user to produce the indication, wherein 

evance factor is the order of the search tenns, set at a value the graphical user interface includes a sliding scale corre- 
of 100. FIG, 9 illustrates the results achieved with the same 20 spending to each relevance factor that is adjusted by the user 

query when the selected relevance factors are words match, to assign a weight to the corresponding relevance factor, 

proximity and field, with values set at 100, 100 ad 10, 5. The computer system according to claim 1, wherein the 

respectively. As can be seen from the results, the search relevance determination module does not alter the query 

query and number of hits remains unchanged, but the performed on the collection of information, 
pr esentation of results differs. 25 6. The computer system according to claim 1, wherein 

©y^impiementing a search engme in this manner, the user each relevance factor is assigned a base value that is 

can control the ranking and presentation of documents that specified by the user, wherein the base value corresponds to 

result from the search, based on the user's understanding of a weight of the corresponding relevance factor, 

the factors that may affect the relevance of the documents to 7. a computer-implemented method for providing user- 
the query. In addition, the user can modify these factors 30 controllable relevance ranking of search results of a current 

without modifying the query. search from a query on a collection of items of information, 

Having now described a few embodiments of the comprising steps of: 

invention,itshouldbeapparenttothoseskilledintheartthat receiving relevance factors input by a user through a 

the foregoing is merely lUustrative and not hmitmg, having graphical user interface; 

been presented by way of example only. Numerous modi- 35 . . . . r 

f. . . J ,u tl J • * ML • c receiving one or more search terms irom a user; 

ncations and other embodiments are within the scope of one ^ ' 

of ordinary skiU in the art and are contemplated as falling performing the query using the one or more search terms 

within the scope of the invention as defined by the appended producing a set of search results of the current 

claims and equivalent thereto. search; 

What is claimed is: 40 indicating, in the search results, items in the collection 

1. A computer system for providing user-controllable matching the query; 

relevance ranking of search results from a query on a receiving information about the items in the set of search 

collection of items of information, comprising: results of the current search to which the relevance 

a relevance determination module having a first input for factors are applied to detenmine a score for each of the 

receiving a set of search results of a current search from items; 

a query indicating items in the collection matching the providing an indication of the score indicative of rel- 

query, a second input for receiving relevance factors evance for each of the items in the set of search results; 

input by a user through a graphical user interface, and and 

a third input for receiving information about the items providing to the user an indication of the items in the set 

in the set ofsearch results ofthe current search to which of search results in an order ranked according to the 

relevance factors are applied to determine a score for relevance score of each item. 

each of the items, and an output for providing an 8. The computer-implemented method according to claim 

indication of the score indicative of relevance for each 7, wherein the step of providing an indicaiion of score 

of the items in the set of search results; and includes a step of totaling individual scores of relevance 

a sorting module which receives the score associated with factors. 

each item and an indication of the set of search results, 9. The computer-implemented method according to claim 

and an output providing the user an indication of the 8, comprising a step of totaling scores of relevance factors 

items in the set of search results in an order ranked including at least one of a group of relevance factors 

according to the relevance score of each item. including: 

2. The computer system according to claim 1, wherein the (a) location of the search term in an item in the collection 
query includes a search term and the relevance factors of items; 

include at least one of a group of relevance factors including: (b) location of the search term in a field of the item; 

(a) location of the search term in an item in the collection (c) position of the search term in the item; 

of items; (d) frequency of occurrence of the search term in the item; 

(b) location of the search term in a field of the item; (e) frequency of occurrence ofthe search term in all items 

(c) position of the search term in the item; of the collection of items; 
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(f) leQgth of a term of an item that is a stem of the search 
term; 

(g) location of the item within a directory of files; and 

(h) recency of the item. 

10. The computer-implemented method according to 
claim 8, comprising a step of totaling individual scores of 
relevance factors including at least one of a group of 
relevance factors inclxiding: 

(a) number of search terms found in a item; 

(b) ordering of search terms in the item; and 

(c) pairwise distance between the search terms in the item. 
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11. The computer-implemented method according to 
claim 7, comprising a step of collecting relevance factor 
information from a user to produce the indication of the 
score, wherein the graphical user interface includes a sliding 

^ scale corresponding to each relevance factor that is adjusted 
by the user to assign a weight to the corresponding relevance 
factor. 

12. The computer system according to claim 7, including 
a step of assigning a base value, specified by the user, to each 
relevance factor, wherein the base value corresponds to a 

10 weight of the corresponding relevance factor. 



10/08/2003, EAST Version: 1.04.0000 



